14 research outputs found

    Programming models to support data science workflows

    Get PDF
    Data Science workflows have become a must to progress in many scientific areas such as life, health, and earth sciences. In contrast to traditional HPC workflows, they are more heterogeneous; combining binary executions, MPI simulations, multi-threaded applications, custom analysis (possibly written in Java, Python, C/C++ or R), and real-time processing. Furthermore, in the past, field experts were capable of programming and running small simulations. However, nowadays, simulations requiring hundreds or thousands of cores are widely used and, to this point, efficiently programming them becomes a challenge even for computer sciences. Thus, programming languages and models make a considerable effort to ease the programmability while maintaining acceptable performance. This thesis contributes to the adaptation of High-Performance frameworks to support the needs and challenges of Data Science workflows by extending COMPSs, a mature, general-purpose, task-based, distributed programming model. First, we enhance our prototype to orchestrate different frameworks inside a single programming model so that non-expert users can build complex workflows where some steps require highly optimised state of the art frameworks. This extension includes the @binary, @OmpSs, @MPI, @COMPSs, and @MultiNode annotations for both Java and Python workflows. Second, we integrate container technologies to enable developers to easily port, distribute, and scale their applications to distributed computing platforms. This combination provides a straightforward methodology to parallelise applications from sequential codes along with efficient image management and application deployment that ease the packaging and distribution of applications. We distinguish between static, HPC, and dynamic container management and provide representative use cases for each scenario using Docker, Singularity, and Mesos. Third, we design, implement and integrate AutoParallel, a Python module to automatically find an appropriate task-based parallelisation of affine loop nests and execute them in parallel in a distributed computing infrastructure. It is based on sequential programming and requires one single annotation (the @parallel Python decorator) so that anyone with intermediate-level programming skills can scale up an application to hundreds of cores. Finally, we propose a way to extend task-based management systems to support continuous input and output data to enable the combination of task-based workflows and dataflows (Hybrid Workflows) using one single programming model. Hence, developers can build complex Data Science workflows with different approaches depending on the requirements without the effort of combining several frameworks at the same time. Also, to illustrate the capabilities of Hybrid Workflows, we have built a Distributed Stream Library that can be easily integrated with existing task-based frameworks to provide support for dataflows. The library provides a homogeneous, generic, and simple representation of object and file streams in both Java and Python; enabling complex workflows to handle any data type without dealing directly with the streaming back-end.Els fluxos de treball de Data Science s’han convertit en una necessitat per progressar en moltes àrees científiques com les ciències de la vida, la salut i la terra. A diferència dels fluxos de treball tradicionals per a la CAP, els fluxos de Data Science són més heterogenis; combinant l’execució de binaris, simulacions MPI, aplicacions multiprocés, anàlisi personalitzats (possiblement escrits en Java, Python, C / C ++ o R) i computacions en temps real. Mentre que en el passat els experts de cada camp eren capaços de programar i executar petites simulacions, avui dia, aquestes simulacions representen un repte fins i tot per als experts ja que requereixen centenars o milers de nuclis. Per aquesta raó, els llenguatges i models de programació actuals s’esforcen considerablement en incrementar la programabilitat mantenint un rendiment acceptable. Aquesta tesi contribueix a l’adaptació de models de programació per a la CAP per afrontar les necessitats i reptes dels fluxos de Data Science estenent COMPSs, un model de programació distribuïda madur, de propòsit general, i basat en tasques. En primer lloc, millorem el nostre prototip per orquestrar diferent programari per a que els usuaris no experts puguin crear fluxos complexos usant un únic model on alguns passos requereixin tecnologies altament optimitzades. Aquesta extensió inclou les anotacions de @binary, @OmpSs, @MPI, @COMPSs, i @MultiNode per a fluxos en Java i Python. En segon lloc, integrem tecnologies de contenidors per permetre als desenvolupadors portar, distribuir i escalar fàcilment les seves aplicacions en plataformes distribuïdes. A més d’una metodologia senzilla per a paral·lelitzar aplicacions a partir de codis seqüencials, aquesta combinació proporciona una gestió d’imatges i una implementació d’aplicacions eficients que faciliten l’empaquetat i la distribució d’aplicacions. Distingim entre la gestió de contenidors estàtica, CAP i dinàmica i proporcionem casos d’ús representatius per a cada escenari amb Docker, Singularity i Mesos. En tercer lloc, dissenyem, implementem i integrem AutoParallel, un mòdul de Python per determinar automàticament la paral·lelització basada en tasques de nius de bucles afins i executar-los en paral·lel en una infraestructura distribuïda. AutoParallel està basat en programació seqüencial, requereix una sola anotació (el decorador @parallel) i permet a un usuari intermig escalar una aplicació a centenars de nuclis. Finalment, proposem una forma d’estendre els sistemes basats en tasques per admetre dades d’entrada i sortida continus; permetent així la combinació de fluxos de treball i dades (Fluxos Híbrids) en un únic model. Conseqüentment, els desenvolupadors poden crear fluxos complexos seguint diferents patrons sense l’esforç de combinar diversos models al mateix temps. A més, per a il·lustrar les capacitats dels Fluxos Híbrids, hem creat una biblioteca (DistroStreamLib) que s’integra fàcilment amb els models basats en tasques per suportar fluxos de dades. La biblioteca proporciona una representació homogènia, genèrica i simple de seqüències contínues d’objectes i arxius en Java i Python; permetent gestionar qualsevol tipus de dades sense tractar directament amb el back-end de streaming.Los flujos de trabajo de Data Science se han convertido en una necesidad para progresar en muchas áreas científicas como las ciencias de la vida, la salud y la tierra. A diferencia de los flujos de trabajo tradicionales para la CAP, los flujos de Data Science son más heterogéneos; combinando la ejecución de binarios, simulaciones MPI, aplicaciones multiproceso, análisis personalizados (posiblemente escritos en Java, Python, C/C++ o R) y computaciones en tiempo real. Mientras que en el pasado los expertos de cada campo eran capaces de programar y ejecutar pequeñas simulaciones, hoy en día, estas simulaciones representan un desafío incluso para los expertos ya que requieren cientos o miles de núcleos. Por esta razón, los lenguajes y modelos de programación actuales se esfuerzan considerablemente en incrementar la programabilidad manteniendo un rendimiento aceptable. Esta tesis contribuye a la adaptación de modelos de programación para la CAP para afrontar las necesidades y desafíos de los flujos de Data Science extendiendo COMPSs, un modelo de programación distribuida maduro, de propósito general, y basado en tareas. En primer lugar, mejoramos nuestro prototipo para orquestar diferentes software para que los usuarios no expertos puedan crear flujos complejos usando un único modelo donde algunos pasos requieran tecnologías altamente optimizadas. Esta extensión incluye las anotaciones de @binary, @OmpSs, @MPI, @COMPSs, y @MultiNode para flujos en Java y Python. En segundo lugar, integramos tecnologías de contenedores para permitir a los desarrolladores portar, distribuir y escalar fácilmente sus aplicaciones en plataformas distribuidas. Además de una metodología sencilla para paralelizar aplicaciones a partir de códigos secuenciales, esta combinación proporciona una gestión de imágenes y una implementación de aplicaciones eficientes que facilitan el empaquetado y la distribución de aplicaciones. Distinguimos entre gestión de contenedores estática, CAP y dinámica y proporcionamos casos de uso representativos para cada escenario con Docker, Singularity y Mesos. En tercer lugar, diseñamos, implementamos e integramos AutoParallel, un módulo de Python para determinar automáticamente la paralelización basada en tareas de nidos de bucles afines y ejecutarlos en paralelo en una infraestructura distribuida. AutoParallel está basado en programación secuencial, requiere una sola anotación (el decorador @parallel) y permite a un usuario intermedio escalar una aplicación a cientos de núcleos. Finalmente, proponemos una forma de extender los sistemas basados en tareas para admitir datos de entrada y salida continuos; permitiendo así la combinación de flujos de trabajo y datos (Flujos Híbridos) en un único modelo. Consecuentemente, los desarrolladores pueden crear flujos complejos siguiendo diferentes patrones sin el esfuerzo de combinar varios modelos al mismo tiempo. Además, para ilustrar las capacidades de los Flujos Híbridos, hemos creado una biblioteca (DistroStreamLib) que se integra fácilmente a los modelos basados en tareas para soportar flujos de datos. La biblioteca proporciona una representación homogénea, genérica y simple de secuencias continuas de objetos y archivos en Java y Python; permitiendo manejar cualquier tipo de datos sin tratar directamente con el back-end de streaming.Postprint (published version

    Enabling Analytic and HPC Workflows with COMPSs

    Get PDF
    In the recent joint venture between High-Performance Computing (HPC) and Big-Data (BD) Ecosystems towards the Exascale Computing, the scientific community has realized that powerful programming models and high-level abstraction tools are a must. Within this context, the Barcelona Supercomputing Center (BSC) is developing the COMP Superscalar (COMPSs) programming model, whose main objective is to develop applications in a sequential way, while the Runtime System handles the inherent parallelism of the application and abstracts the programmer from the different underlying infrastructures. The parallelism is achieved by defining an application Interface that allows COMPSs to detect methods that operate on a set of parameters (called tasks), and execute them distributedly and transparently. This Master Thesis aims to enhance COMPSs, adapting it to the needs of the Big-Data Ecosystems, by supporting Analytic and HPC workflows. To this end, we propose a straightforward integration with the execution of binaries, and MPI and OmpSs applications. Although the COMPSs programming model is kept untouched, we extend the COMPSs Annotations and some of the COMPSs internals such as the task schedulers and the worker executors. To support our contribution, we have ported to COMPSs two real use cases. On the one hand, NMMB BSC-Dust, a workflow to predict the atmospheric life cycle of the desert dust and, on the other hand, Guidance, an integrated solution for Genome and Phenome association analysis

    Transparent Orchestration of Task-based Parallel Applications in Containers Platforms

    Get PDF
    This paper presents a framework to easily build and execute parallel applications in container-based distributed computing platforms in a user-transparent way. The proposed framework is a combination of the COMP Superscalar (COMPSs) programming model and runtime, which provides a straightforward way to develop task-based parallel applications from sequential codes, and containers management platforms that ease the deployment of applications in computing environments (as Docker, Mesos or Singularity). This framework provides scientists and developers with an easy way to implement parallel distributed applications and deploy them in a one-click fashion. We have built a prototype which integrates COMPSs with different containers engines in different scenarios: i) a Docker cluster, ii) a Mesos cluster, and iii) Singularity in an HPC cluster. We have evaluated the overhead in the building phase, deployment and execution of two benchmark applications compared to a Cloud testbed based on KVM and OpenStack and to the usage of bare metal nodes. We have observed an important gain in comparison to cloud environments during the building and deployment phases. This enables better adaptation of resources with respect to the computational load. In contrast, we detected an extra overhead during the execution, which is mainly due to the multi-host Docker networking.This work is partly supported by the Spanish Government through Programa Severo Ochoa (SEV-2015-0493), by the Spanish Ministry of Science and Technology through TIN2015-65316 project, by the Generalitat de Catalunya under contracts 2014-SGR-1051 and 2014-SGR-1272, and by the European Union through the Horizon 2020 research and innovation program under grant 690116 (EUBra-BIGSEA Project). Results presented in this paper were obtained using the Chameleon testbed supported by the National Science Foundation.Peer ReviewedPostprint (author's final draft

    Transparent execution of task-based parallel applications in Docker with COMP Superscalar

    Get PDF
    This paper presents a framework to easily build and execute parallel applications in container-based distributed computing platforms in a user transparent way. The proposed framework is a combination of the COMP Superscalar and Docker. We have built a prototype in order to evaluate how it performs by evaluating the overhead in the building, deployment and execution phases. We have observed an important gain compared with cloud environments during the building and deployment phases. In contrast, we have detected an extra overhead during the execution, which is mainly due to the multi-host Docker networking.This work is partly supported by the Spanish Government through contracts SEV-2015-0493, TIN2015-65316-P, by the Generalitat de Catalunya under contracts 2014-SGR-1051 and 2014-SGR-1272, and by the European Union under grants 676556 (MuG Project) and 690116 (EUBra-BIGSEA Project). Results presented in this paper were obtained using the Chameleon testbed supported by the NSF.Peer ReviewedPostprint (author's final draft

    Enabling Python to execute efficiently in heterogeneous distributed infrastructures with PyCOMPSs

    Get PDF
    Python has been adopted as programming language by a large number of scientific communities. Additionally to the easy programming interface, the large number of libraries and modules that have been made available by a large number of contributors, have taken this language to the top of the list of the most popular programming languages in scientific applications. However, one main drawback of Python is the lack of support for concurrency or parallelism. PyCOMPSs is a proved approach to support task-based parallelism in Python that enables applications to be executed in parallel in distributed computing platforms. This paper presents PyCOMPSs and how it has been tailored to execute tasks in heterogeneous and multi-threaded environments. We present an approach to combine the task-level parallelism provided by PyCOMPSs with the thread-level parallelism provided by MKL. Performance and behavioral results in distributed computing heterogeneous clusters show the benefits and capabilities of PyCOMPSs in both HPC and Big Data infrastructures.Thiswork has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272). Javier Conejero postdoctoral contract is co-financed by the Ministry of Economy and Competitiveness under Juan de la Cierva Formación postdoctoral fellowship number FJCI- 2015-24651. Cristian Ramon-Cortes predoctoral contract is financed by the Ministry of Economy and Competitiveness under the contract BES-2016-076791. This work is supported by the Intel-BSC Exascale Lab. This work has been supported by the European Commission through the Horizon 2020 Research and Innovation program under contract 687584 (TANGO project).Peer ReviewedPostprint (author's final draft

    COMP Superscalar, an interoperable programming framework

    Get PDF
    COMPSs is a programming framework that aims to facilitate the parallelization of existing applications written in Java, C/C++ and Python scripts. For that purpose, it offers a simple programming model based on sequential development in which the user is mainly responsible for identifying the functions to be executed as asynchronous parallel tasks and annotating them with annotations or standard Python decorators. A runtime system is in charge of exploiting the inherent concurrency of the code, automatically detecting and enforcing the data dependencies between tasks and spawning these tasks to the available resources, which can be nodes in a cluster, clouds or grids. In cloud environments, COMPSs provides scalability and elasticity features allowing the dynamic provision of resources.This work has been supported by the following institutions: the Spanish Government with grant SEV-2011-00067 of the Severo Ochoa Program and contract Computacion de Altas Prestaciones VI (TIN2012-34557); by the SGR programme (2014-SGR-1051) of the Catalan Government; by the project The Human Brain Project, funded by the European Commission under contract 604102; by the ASCETiC project funded by the European Commission under contract 610874; by the EUBrazilCloudConnect project funded by the European Commission under contract 614048; and by the Intel-BSC Exascale Lab collaboration.Peer ReviewedPostprint (published version

    The impact of non-additive genetic associations on age-related complex diseases

    Get PDF
    Genome-wide association studies (GWAS) are not fully comprehensive, as current strategies typically test only the additive model, exclude the X chromosome, and use only one reference panel for genotype imputation. We implement an extensive GWAS strategy, GUIDANCE, which improves genotype imputation by using multiple reference panels and includes the analysis of the X chromosome and non-additive models to test for association. We apply this methodology to 62,281 subjects across 22 age-related diseases and identify 94 genome-wide associated loci, including 26 previously unreported. Moreover, we observe that 27.7% of the 94 loci are missed if we use standard imputation strategies with a single reference panel, such as HRC, and only test the additive model. Among the new findings, we identify three novel low-frequency recessive variants with odds ratios larger than 4, which need at least a three-fold larger sample size to be detected under the additive model. This study highlights the benefits of applying innovative strategies to better uncover the genetic architecture of complex diseases. Most genome-wide association studies assume an additive model, exclude the X chromosome, and use one reference panel. Here, the authors implement a strategy including non-additive models and find that the number of loci for age-related traits increases as compared to the additive model alone.Peer reviewe

    Multicentre, randomised, open-label, phase IV–III study to evaluate the efficacy of cloxacillin plus fosfomycin versus cloxacillin alone in adult patients with methicillin-susceptible Staphylococcus aureus bacteraemia: study protocol for the SAFO trial

    Get PDF
    SAFO study group and the Spanish Network for Research in Infectious Diseases (REIPI).[Introduction] Methicillin-susceptible Staphylococcus aureus (MSSA) bacteraemia is a frequent condition, with high mortality rates. There is a growing interest in identifying new therapeutic regimens able to reduce therapeutic failure and mortality observed with the standard of care of beta-lactam monotherapy. In vitro and small-scale studies have found synergy between cloxacillin and fosfomycin against S. aureus. Our aim is to test the hypothesis that cloxacillin plus fosfomycin achieves higher treatment success than cloxacillin alone in patients with MSSA bacteraemia.[Methods] We will perform a superiority, randomised, open-label, phase IV–III, two-armed parallel group (1:1) clinical trial at 20 Spanish tertiary hospitals. Adults (≥18 years) with isolation of MSSA from at least one blood culture ≤72 hours before inclusion with evidence of infection, will be randomly allocated to receive either cloxacillin 2 g/4-hour intravenous plus fosfomycin 3 g/6-hour intravenous or cloxacillin 2 g/4-hour intravenous alone for 7 days. After the first week, sequential treatment and total duration of antibiotic therapy will be determined according to clinical criteria by the attending physician. Primary endpoints: (1) Treatment success at day 7, a composite endpoint comprising all the following criteria: patient alive, stable or with improved quick-Sequential Organ Failure Assessment score, afebrile and with negative blood cultures for MSSA at day 7. (2) Treatment success at test of cure (TOC) visit: patient alive and no isolation of MSSA in blood culture or at another sterile site from day 8 until TOC (12 weeks after randomisation). We assume a rate of treatment success of 74% in the cloxacillin group. Accepting alpha risk of 0.05 and beta risk of 0.2 in a two-sided test, 183 subjects will be required in each of the control and experimental groups to obtain statistically significant difference of 12% (considered clinically significant).[Ethics and dissemination] Ethical approval has been obtained from the Ethics Committee of Bellvitge University Hospital (AC069/18) and from the Spanish Medicines and Healthcare Product Regulatory Agency (AEMPS, AC069/18), and is valid for all participating centres under existing Spanish legislation. The results will be presented at international meetings and will be made available to patients and funders.[Trial registration number] The protocol has been approved by AEMPS with the Trial Registration Number EudraCT 2018-001207-37. ClinicalTrials.gov Identifier: NCT03959345; Pre-results.The SAFO trial is supported by a competitive grant awarded by the Fondo de Investigaciones Sanitarias at the Spanish government’s National Institute of Health Research, Instituto de Salud Carlos III (ISCIII), (FIS PI17/01116). This study was supported by Plan Nacional de I+D+i 2017–2021 and Instituto de Salud Carlos III, Subdirección General de Redes y Centros de Investigación Cooperativa, Ministerio de Economía, Industria y Competitividad, Spanish Network for Research in Infectious Diseases (REIPI RD16/0016/0005).Peer reviewe

    Multicentre, randomised, open-label, phase IV-III study to evaluate the efficacy of cloxacillin plus fosfomycin versus cloxacillin alone in adult patients with methicillin-susceptible Staphylococcus aureus bacteraemia: study protocol for the SAFO trial

    Full text link
    Introduction: Methicillin-susceptible Staphylococcus aureus (MSSA) bacteraemia is a frequent condition, with high mortality rates. There is a growing interest in identifying new therapeutic regimens able to reduce therapeutic failure and mortality observed with the standard of care of beta-lactam monotherapy. In vitro and small-scale studies have found synergy between cloxacillin and fosfomycin against S. aureus. Our aim is to test the hypothesis that cloxacillin plus fosfomycin achieves higher treatment success than cloxacillin alone in patients with MSSA bacteraemia. Methods: We will perform a superiority, randomised, open-label, phase IV-III, two-armed parallel group (1:1) clinical trial at 20 Spanish tertiary hospitals. Adults (≥18 years) with isolation of MSSA from at least one blood culture ≤72 hours before inclusion with evidence of infection, will be randomly allocated to receive either cloxacillin 2 g/4-hour intravenous plus fosfomycin 3 g/6-hour intravenous or cloxacillin 2 g/4-hour intravenous alone for 7 days. After the first week, sequential treatment and total duration of antibiotic therapy will be determined according to clinical criteria by the attending physician. Primary endpoints: (1) Treatment success at day 7, a composite endpoint comprising all the following criteria: patient alive, stable or with improved quick-Sequential Organ Failure Assessment score, afebrile and with negative blood cultures for MSSA at day 7. (2) Treatment success at test of cure (TOC) visit: patient alive and no isolation of MSSA in blood culture or at another sterile site from day 8 until TOC (12 weeks after randomisation). We assume a rate of treatment success of 74% in the cloxacillin group. Accepting alpha risk of 0.05 and beta risk of 0.2 in a two-sided test, 183 subjects will be required in each of the control and experimental groups to obtain statistically significant difference of 12% (considered clinically significant). Ethics and dissemination: Ethical approval has been obtained from the Ethics Committee of Bellvitge University Hospital (AC069/18) and from the Spanish Medicines and Healthcare Product Regulatory Agency (AEMPS, AC069/18), and is valid for all participating centres under existing Spanish legislation. The results will be presented at international meetings and will be made available to patients and funders

    Disseny, implementació i integració d'una mà per a un robot Darwin-OP

    No full text
    Aquest projecte vol dotar al robot Darwin-OP de la capacitat d’agafar petits objectes mitjançant el disseny, la implementació i la integració d’una mà antropomòrfica (avantbraç, canell, mà i dits). Es considera de gran importància que el prototipus final sigui de construcció ràpida i baix cost. En primer lloc es presenta breument el context actual de la robòtica, centrant-se particularment en el desenvolupament de braços robòtics i de petits humanoides. Tanmateix, es fa una comparació del robot emprat per aquest projecte i el seu competidor directe: Nao. Seguidament es detalla el procés de disseny de la mà que es duu a terme mitjançant un programari de CAD-3D (SolidWorks). Al final d’aquest apartat es fa especial al·lusió al disseny del prototipus final, detallant-ne els plànols per a que els usuaris del robot Darwin-OP el puguin emprar. Gairebé simultàniament a aquest darrer procés s’explica el procés d’implementació, que s’ha realitzat mitjançant una impressora d’ABS 3D. En particular, es fa referència als avantatges i als inconvenients que aquest mètode aporta, i a la senzillesa constructiva de muntatge de la mà. A continuació es tanca el procés de construcció amb la integració hardware i software del prototipus al robot Darwin-OP. En aquest apartat es presenten els servomotors Robotis Dynamixel XL-320 emprats, s’especifica la seva instal·lació per a unir la mà amb el robot i es detalla el codi del controlador per a que Darwin-OP reconegui i es comuniqui amb els servomotors. A més, s’adjunten les proves realitzades per a verificar el correcte funcionament del prototipus. Per últim, es constata que el prototipus construït compleix les necessitats inicials: dota al Darwin-OP de la capacitat d’agafar objectes petits, és de construcció ràpida i és de baix cost. Així mateix, s’enumeren els aspectes ampliables d’aquest projecte com podria ser la introducció de sensors de pressió als dits
    corecore